##Introduction
As a company that specializes in Talent Management we have been assigned in Identifying the top tree factors that lead to employee Attrition/Turnover. Additionally we have been tasked with creating a model that predicts Attrition as well as model that predicts Monthly Income for the corporations employees.
Youtube: https://youtu.be/k0ob1JmyLf4
#Reading and tidying datasets
##Reading Test data set for Monthly Income
##Removing unnecessary columns from training set and setting all categorial to be factors
##Training data EDA
## # A tibble: 6 x 2
## Attrition n
## <fct> <int>
## 1 No 29
## 2 Yes 6
## 3 No 487
## 4 Yes 75
## 5 No 214
## 6 Yes 59
There seems to be a quadratic trend, there’s a high level of attriction in late teens and early 20s. It levels off in the 30s, and starts picking back up in the 50s
## Warning: Removed 5 rows containing non-finite values (stat_smooth).
## Warning: Removed 5 rows containing missing values (geom_point).
##Attrition by JobSatisfaction
Seems to be a very strong correlation between JobSatisfaction and attrition rate, with the greater job satisfaction the better less the likelhood for attrition.
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.
##Attrition by total working years
Similar to age it seems that there is less likelihood
## Warning: Removed 10 rows containing missing values (geom_point).
##Attrition by Job Role Sales representative appear to have a much higher attrition rate
##Attrition by PercentSalaryHike There’s a very small correlation between percent salary hike and attrition
##Attrition by hourly rate Doesn’t appear to be any real correlation between hourly rate and attrition
## Warning: Removed 12 rows containing missing values (geom_point).
##Attrition by OverTime Working overtime appears to have a significant impact on attrition rate
##Attrition by Monthly Income
## Warning: Removed 813 rows containing missing values (geom_point).
##Choosing to test Bayes models with factor that had the most impact on attrtion Age, Job Satisfaction, Totalworkinyears, Job Role, and Hourly Rate
Models is 86% accurate but low on specificity
## Confusion Matrix and Statistics
##
##
## No Yes
## No 225 2
## Yes 32 2
##
## Accuracy : 0.8697
## 95% CI : (0.8227, 0.9081)
## No Information Rate : 0.9847
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.08
##
## Mcnemar's Test P-Value : 6.577e-07
##
## Sensitivity : 0.87549
## Specificity : 0.50000
## Pos Pred Value : 0.99119
## Neg Pred Value : 0.05882
## Prevalence : 0.98467
## Detection Rate : 0.86207
## Detection Prevalence : 0.86973
## Balanced Accuracy : 0.68774
##
## 'Positive' Class : No
##
## Confusion Matrix and Statistics
##
##
## No Yes
## No 226 1
## Yes 34 0
##
## Accuracy : 0.8659
## 95% CI : (0.8185, 0.9048)
## No Information Rate : 0.9962
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.0075
##
## Mcnemar's Test P-Value : 6.338e-08
##
## Sensitivity : 0.8692
## Specificity : 0.0000
## Pos Pred Value : 0.9956
## Neg Pred Value : 0.0000
## Prevalence : 0.9962
## Detection Rate : 0.8659
## Detection Prevalence : 0.8697
## Balanced Accuracy : 0.4346
##
## 'Positive' Class : No
##
## Confusion Matrix and Statistics
##
##
## No Yes
## No 223 4
## Yes 29 5
##
## Accuracy : 0.8736
## 95% CI : (0.827, 0.9113)
## No Information Rate : 0.9655
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1883
##
## Mcnemar's Test P-Value : 2.943e-05
##
## Sensitivity : 0.8849
## Specificity : 0.5556
## Pos Pred Value : 0.9824
## Neg Pred Value : 0.1471
## Prevalence : 0.9655
## Detection Rate : 0.8544
## Detection Prevalence : 0.8697
## Balanced Accuracy : 0.7202
##
## 'Positive' Class : No
##
## Confusion Matrix and Statistics
##
##
## No Yes
## No 224 3
## Yes 30 4
##
## Accuracy : 0.8736
## 95% CI : (0.827, 0.9113)
## No Information Rate : 0.9732
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1577
##
## Mcnemar's Test P-Value : 6.011e-06
##
## Sensitivity : 0.8819
## Specificity : 0.5714
## Pos Pred Value : 0.9868
## Neg Pred Value : 0.1176
## Prevalence : 0.9732
## Detection Rate : 0.8582
## Detection Prevalence : 0.8697
## Balanced Accuracy : 0.7267
##
## 'Positive' Class : No
##
## Confusion Matrix and Statistics
##
##
## No Yes
## No 222 2
## Yes 32 5
##
## Accuracy : 0.8697
## 95% CI : (0.8227, 0.9081)
## No Information Rate : 0.9732
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1908
##
## Mcnemar's Test P-Value : 6.577e-07
##
## Sensitivity : 0.8740
## Specificity : 0.7143
## Pos Pred Value : 0.9911
## Neg Pred Value : 0.1351
## Prevalence : 0.9732
## Detection Rate : 0.8506
## Detection Prevalence : 0.8582
## Balanced Accuracy : 0.7942
##
## 'Positive' Class : No
##
## [1] 0.8406897
## [1] 0.002048626
## [1] 0.8509524
## [1] 0.002089431
## [1] 0.57255
## [1] 0.002089431
## Confusion Matrix and Statistics
##
##
## No Yes
## No 214 4
## Yes 38 5
##
## Accuracy : 0.8391
## 95% CI : (0.7888, 0.8815)
## No Information Rate : 0.9655
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1435
##
## Mcnemar's Test P-Value : 3.543e-07
##
## Sensitivity : 0.8492
## Specificity : 0.5556
## Pos Pred Value : 0.9817
## Neg Pred Value : 0.1163
## Prevalence : 0.9655
## Detection Rate : 0.8199
## Detection Prevalence : 0.8352
## Balanced Accuracy : 0.7024
##
## 'Positive' Class : No
##
## Confusion Matrix and Statistics
##
##
## No Yes
## No 211 7
## Yes 37 6
##
## Accuracy : 0.8314
## 95% CI : (0.7804, 0.8748)
## No Information Rate : 0.9502
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1492
##
## Mcnemar's Test P-Value : 1.232e-05
##
## Sensitivity : 0.8508
## Specificity : 0.4615
## Pos Pred Value : 0.9679
## Neg Pred Value : 0.1395
## Prevalence : 0.9502
## Detection Rate : 0.8084
## Detection Prevalence : 0.8352
## Balanced Accuracy : 0.6562
##
## 'Positive' Class : No
##
## [1] 0.842069
## [1] 0.002061901
## [1] 0.8508076
## [1] 0.002113658
## [1] 0.6040192
## [1] 0.002113658
## [1] 0.8407663
## [1] 0.00207462
## [1] 0.8507683
## [1] 0.002080485
## [1] 0.5916438
## [1] 0.002080485
## [1] 0.849387
## [1] 0.001940469
## [1] 0.8587652
## [1] 0.002087962
## [1] 0.6427007
## [1] 0.002087962
##Best Bayes model included Age, JobRole, JobSatisfaction, and Overtime Accuracy of 85%, sensitiviy of .85 and specificity of .65
## [1] 0.849387
## [1] 0.001940469
## [1] 0.8587652
## [1] 0.002087962
## [1] 0.6427007
## [1] 0.002087962
##Comparing against Knn Model
## integer(0)
## [1] NA
## [1] 0.8330268
## [1] 0.00218046
## [1] 0.8517895
## [1] 0.002144115
## [1] 0.4511064
## [1] 0.002144115
##Classifying Attrition for Test Data
##EDA for imputing Monthly Income
So far highest correlatoin is between Total working years and monthly income Total working years has a .779 corr while years at company has .491 corr JobLevel has a corr of .952 Age has a .485 correlation Years since last promotion has a .316 correlation
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
##First Model for computing Monthly Incomes First model using Joblevel and income has a rmse of 1410.878
##
## Call:
## lm(formula = MonthlyIncome ~ JobLevel, data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4642.2 -668.0 -107.3 668.3 4412.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2743.82 69.69 39.37 <2e-16 ***
## JobLevel2 2800.46 99.89 28.04 <2e-16 ***
## JobLevel3 7108.38 130.24 54.58 <2e-16 ***
## JobLevel4 12509.83 177.45 70.50 <2e-16 ***
## JobLevel5 16480.15 219.18 75.19 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1264 on 865 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9244
## F-statistic: 2658 on 4 and 865 DF, p-value: < 2.2e-16
## 2.5 % 97.5 %
## (Intercept) 2607.044 2880.604
## JobLevel2 2604.402 2996.509
## JobLevel3 6852.766 7363.996
## JobLevel4 12161.551 12858.101
## JobLevel5 16049.957 16910.342
##
## Call:
## lm(formula = MonthlyIncome ~ JobLevel, data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4642.2 -668.0 -107.3 668.3 4412.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2743.82 69.69 39.37 <2e-16 ***
## JobLevel2 2800.46 99.89 28.04 <2e-16 ***
## JobLevel3 7108.38 130.24 54.58 <2e-16 ***
## JobLevel4 12509.83 177.45 70.50 <2e-16 ***
## JobLevel5 16480.15 219.18 75.19 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1264 on 865 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9244
## F-statistic: 2658 on 4 and 865 DF, p-value: < 2.2e-16
## [1] 1216.151
##2nd Model ading TotalWorkingYears
Adding the totalworkingyears got a better error with 1365
##
## Call:
## lm(formula = MonthlyIncome ~ JobLevel + TotalWorkingYears, data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4957.9 -657.8 -134.6 618.2 4525.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2544.901 89.085 28.567 < 2e-16 ***
## JobLevel2 2652.205 107.666 24.634 < 2e-16 ***
## JobLevel3 6820.371 152.732 44.656 < 2e-16 ***
## JobLevel4 11858.212 254.564 46.582 < 2e-16 ***
## JobLevel5 15800.546 289.997 54.485 < 2e-16 ***
## TotalWorkingYears 33.442 9.426 3.548 0.000409 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1256 on 864 degrees of freedom
## Multiple R-squared: 0.9258, Adjusted R-squared: 0.9254
## F-statistic: 2157 on 5 and 864 DF, p-value: < 2.2e-16
## 2.5 % 97.5 %
## (Intercept) 2370.05330 2719.74815
## JobLevel2 2440.88699 2863.52254
## JobLevel3 6520.60145 7120.13957
## JobLevel4 11358.57533 12357.84882
## JobLevel5 15231.36546 16369.72723
## TotalWorkingYears 14.94155 51.94211
##
## Call:
## lm(formula = MonthlyIncome ~ JobLevel + TotalWorkingYears, data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4957.9 -657.8 -134.6 618.2 4525.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2544.901 89.085 28.567 < 2e-16 ***
## JobLevel2 2652.205 107.666 24.634 < 2e-16 ***
## JobLevel3 6820.371 152.732 44.656 < 2e-16 ***
## JobLevel4 11858.212 254.564 46.582 < 2e-16 ***
## JobLevel5 15800.546 289.997 54.485 < 2e-16 ***
## TotalWorkingYears 33.442 9.426 3.548 0.000409 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1256 on 864 degrees of freedom
## Multiple R-squared: 0.9258, Adjusted R-squared: 0.9254
## F-statistic: 2157 on 5 and 864 DF, p-value: < 2.2e-16
## [1] 1203.668
##3rd Model adding age as well
Found that adding the factors with most Correllations, that being JobLevel, Age, TotalWorkingYears gave the lowes RMSE of around 1200.
##
## Call:
## lm(formula = MonthlyIncome ~ JobLevel + Age + TotalWorkingYears,
## data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4947.4 -652.9 -136.8 615.3 4542.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2417.795 197.719 12.228 <2e-16 ***
## JobLevel2 2653.840 107.720 24.636 <2e-16 ***
## JobLevel3 6825.867 152.965 44.624 <2e-16 ***
## JobLevel4 11869.515 255.118 46.525 <2e-16 ***
## JobLevel5 15811.576 290.482 54.432 <2e-16 ***
## Age 4.548 6.316 0.720 0.4716
## TotalWorkingYears 29.545 10.871 2.718 0.0067 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1256 on 863 degrees of freedom
## Multiple R-squared: 0.9259, Adjusted R-squared: 0.9254
## F-statistic: 1797 on 6 and 863 DF, p-value: < 2.2e-16
## 2.5 % 97.5 %
## (Intercept) 2029.728420 2805.86250
## JobLevel2 2442.416087 2865.26407
## JobLevel3 6525.639779 7126.09387
## JobLevel4 11368.789547 12370.24015
## JobLevel5 15241.442578 16381.70959
## Age -7.847635 16.94390
## TotalWorkingYears 8.209551 50.88143
##
## Call:
## lm(formula = MonthlyIncome ~ JobLevel + Age + TotalWorkingYears,
## data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4947.4 -652.9 -136.8 615.3 4542.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2417.795 197.719 12.228 <2e-16 ***
## JobLevel2 2653.840 107.720 24.636 <2e-16 ***
## JobLevel3 6825.867 152.965 44.624 <2e-16 ***
## JobLevel4 11869.515 255.118 46.525 <2e-16 ***
## JobLevel5 15811.576 290.482 54.432 <2e-16 ***
## Age 4.548 6.316 0.720 0.4716
## TotalWorkingYears 29.545 10.871 2.718 0.0067 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1256 on 863 degrees of freedom
## Multiple R-squared: 0.9259, Adjusted R-squared: 0.9254
## F-statistic: 1797 on 6 and 863 DF, p-value: < 2.2e-16
## [1] 1203.884
##Imputing the values for Test Set